TPM034A Machine Learning for socio-technical systems¶

Lab session 01: Discover, explore and visualise data¶

Delft University of Technology
Q2 2025
Instructor: Sander van Cranenburgh & Francisco Garrido Valenzuela

Instructions¶

Lab session aim to:

  • Show and reinforce how models and ideas presented in class are put to practice.
  • Help you gather hands-on machine learning skills.

Lab sessions are:

  • Learning environments where you work with Jupyter notebooks and where you can get support from TAs and fellow students.
  • Not graded and do not have to be submitted.
  • A good preparation for the assignments (which are graded).


Use of AI tools¶

AI tools, such as ChatGPT and GitHub Copilot, can be valuable aids for programming. In your later careers, you will also work in an environment where such tools are widely available. However, they cannot replace your own reasoning: you still need to conceptualise the problem, break it down, and structure it to conduct proper analysis. We therefore encourage you to use AI tools responsibly — as a support, not a substitute for your own thinking.

⚠️ Important: During the practical exam you will NOT have access to these tools (since internet access will be restricted).

Google Colab workspace set-up¶

Uncomment the following cells code lines if you are running this notebook on Colab

In [1]:
#!git clone https://github.com/TPM034A/Q2_2025
#!pip install -r Q2_2025/requirements_colab.txt
#!mv "/content/Q2_2025/Lab_sessions/lab_session_01/data" /content/data

Application: liveability in the Netherlands

In this lab session, we will explore how liveable the Netherlands is. Liveability is high on the policy agenda. Many policies directly or indirectly aim to improve liveability, such as e.g. lowering traffic speed, curbing noise levels, maintaining public gardens and spaces, etc. But liveability is hard to study due to its elusive and complex subjective nature. In this lab session, we will discover how liveability is spatially distributed, and what factors associate positively or negatively with liveability.

Learning objectives. After completing the following exercises you will be able to:

  1. Construct a data set from multiple sources using pandas
  2. Discover and visualise data, using seaborn and geopandas
  3. Explore associations using scatter plots and correlation heat maps
  4. Conduct linear regression, using sk-learn
In [2]:
# Import required Python packages and modules
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt


# Import selected functions and classes from Python packages
from os import getcwd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

1. Constructing a data set from multiple data sources

To construct a data set from multiple data sources, we will:

  • A. Load the 3 data sets:
    • A.1. Liveability data
    • A.2. Population statistics data
    • A.3. geospatial data
  • B. Clean the data
  • C. Combine the data
A. Loading data

We first load our primary data source on liveability. There are three liveability data sets, with varying spatial levels:

  • Gemeente Municipality
  • Wijk Neighbourhood
  • Buurt Local neighbourhood
In [3]:
# Get the current working directory
working_folder = getcwd()
data_folder = Path(f'data')
print(data_folder)
data
In [4]:
# Load liveability data, using pandas
gemeente_liv_data = pd.read_csv(data_folder/'gemeente_liveability.csv')
wijk_liv_data = pd.read_csv(data_folder/'wijk_liveability.csv')
buurt_liv_data = pd.read_csv(data_folder/'buurt_liveability.csv')
print(list(gemeente_liv_data.keys()))
['gm_code', 'versie', 'jaar', 'gm_naam', 'lbm', 'afw', 'fys', 'onv', 'soc', 'vrz', 'won']

Description of variables
gm_code   Unique code for gemeente
versie    Version of the 'Leefbaarometer model' used
jaar      Year
gm_naam Name of gemeente

Liveability indicators:
lbm  lbm is the key variable of interest in this lab session. It is composed of the variables below:
afw  Deviation from previous model
fys  Physical environment
onv  Nuisance and insecurity
soc  Social cohesion
vrz  Amenities
won  Housing stock

We load the second source of data. These data contains population statistics.

In [5]:
# Load files on population statistics
gemeente_pop_data = pd.read_csv(data_folder/'gemeente_2020_pop.csv', sep=';')
wijk_pop_data = pd.read_csv(data_folder/'wijk_2020_pop.csv', sep=';')
buurt_pop_data = pd.read_csv(data_folder/'buurt_2020_pop.csv', sep=';')
gemeente_pop_data.keys()
Out[5]:
Index(['GM_CODE', 'GM_NAAM', 'H2O', 'OAD', 'STED', 'BEV_DICHTH', 'AANT_INW',
       'AANT_MAN', 'AANT_VROUW', 'P_00_14_JR',
       ...
       'AF_PODIUM', 'AV5_PODIUM', 'AV10PODIUM', 'AV20PODIUM', 'AF_MUSEUM',
       'AV5_MUSEUM', 'AV10MUSEUM', 'AV20MUSEUM', 'JRSTATCODE', 'JAAR'],
      dtype='object', length=178)

Finally, we load data on the geospatial data of gemeenten, wijken and buurten. To do this, load the shape files using geopandas.

In [6]:
# Load shape files
gemeente_shape = gpd.read_file(data_folder/'gemeente/gemeente 2020.shp')
wijk_shape = gpd.read_file(data_folder/'wijk/wijk 2020.shp')
buurt_shape = gpd.read_file(data_folder/'buurt/buurt 2020.shp')
print(buurt_shape.info())
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 13808 entries, 0 to 13807
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   bu_code   13808 non-null  object  
 1   bu_naam   13808 non-null  object  
 2   gm_naam   13808 non-null  object  
 3   geometry  13808 non-null  geometry
dtypes: geometry(1), object(3)
memory usage: 431.6+ KB
None
In [7]:
# Ensure that the shape files are all in the right projection
gemeente_shape.to_crs(epsg=28992, inplace=True)
wijk_shape.to_crs(epsg=28992, inplace=True)
buurt_shape.to_crs(epsg=28992, inplace=True)

# Print crs info
print(gemeente_shape.crs)
print(wijk_shape.crs)
print(buurt_shape.crs)
EPSG:28992
EPSG:28992
EPSG:28992
B. Cleaning data¶

First, we clean the liveability data sets.
More specifically, we:

  • Extract only those data points for which we have a liveability score.
  • Extract only those data points that are created by Leefbaarometer 3.0 and for year 2020.
In [8]:
# Remove rows where lbm is NaN
gemeente_liv_data.dropna(subset = ['lbm'], inplace = True)

# Extract subsets: only year 2020 and Leefbaarometer 3.0 data
gemeente_liv_data = gemeente_liv_data.loc[(gemeente_liv_data.versie ==  'Leefbaarometer 3.0') & (gemeente_liv_data.jaar == 2020)]
wijk_liv_data.dropna(subset = ['lbm'], inplace = True)
wijk_liv_data = wijk_liv_data.loc[(wijk_liv_data.versie ==  'Leefbaarometer 3.0') & (wijk_liv_data.jaar == 2020)]
buurt_liv_data.dropna(subset = ['lbm'], inplace = True)
buurt_liv_data = buurt_liv_data.loc[(buurt_liv_data.versie ==  'Leefbaarometer 3.0') & (buurt_liv_data.jaar == 2020)]

Next, we clean the population data sets.
More specifically, we:

  • Remove useless data points (i.e. water areas, opposed to land areas)
  • Drop data points where the percentage of owner occupied homes and rented homes equals 0 (for later analyses)
In [9]:
def clean_pop_data(pop_data):

    # Extract subsets: only data where is land
    pop_data  = pop_data.loc[pop_data.H2O == 'NEE'] 

    # Remove rows where P_KOOPWON == 0 or P_HUURWON == 0
    pop_data = pop_data.loc[(pop_data.P_KOOPWON>0) | (pop_data.P_HUURWON>0)]

    return pop_data

gemeente_pop_data = clean_pop_data(gemeente_pop_data)
wijk_pop_data = clean_pop_data(wijk_pop_data)
buurt_pop_data = clean_pop_data(buurt_pop_data)
C. Combining data¶

Now, we pull together the population and liveability data sets, using pandas' merge function. We use GM_CODE, WK_CODE and BU_CODE as key indices to merge on.
For the merge operation, the keys must be the same. Therefore, we first capitalise the keys of the shape files to be the same as in the other files.

In [10]:
# Capitalise column name of variables to merge on (to be the same across data sets)
gemeente_liv_data.rename(columns={"gm_code": "GM_CODE"}, inplace=True)
wijk_liv_data.rename(columns={"wk_code": "WK_CODE"}, inplace=True)
buurt_liv_data.rename(columns={"bu_code": "BU_CODE"}, inplace=True)

# Merge DataFrames
gemeente_data = gemeente_pop_data.merge(gemeente_liv_data, on ="GM_CODE", how = "inner")
wijk_data = wijk_pop_data.merge(wijk_liv_data, on="WK_CODE", how = "inner")
buurt_data = buurt_pop_data.merge(buurt_liv_data, on="BU_CODE", how = "inner")

Next, we add the geospatial (shape) data, using merge. Again, we use GM_CODE, WK_CODE, and BU_CODE as indices.
For the merge operation, the keys must be the same. Therefore, we first capitalise the keys of the shape files.

In [11]:
# Capitalise column name of key variables
gemeente_shape = gemeente_shape.rename(columns={"gm_code": "GM_CODE"}).drop(columns = 'gm_naam')
wijk_shape = wijk_shape.rename(columns={"wk_code": "WK_CODE"}).drop(columns = 'wk_naam')
buurt_shape = buurt_shape.rename(columns={"bu_code": "BU_CODE"}).drop(columns = 'bu_naam')

# Merge the GeoDataFrame and the DataFrame
gemeente_df = gemeente_shape.merge(gemeente_data, on="GM_CODE")
wijk_df = wijk_shape.merge(wijk_data, on="WK_CODE")
buurt_df = buurt_shape.merge(buurt_data, on="BU_CODE")

Exercise 1: Exploring the data¶

A Load data file Gemeenten alfabetisch 2020.excel into a pandas DataFrame, using pd.read_excel().

B Explore the content, e.g. using .describe(), .head(), and .info()

C Add the provincies (i.e. the column Provincienaam) of the Netherlands to the gemeente_df DataFrame, the wijk_df DataFrame. and the buurt_df DataFrame.
To do so, you need to merge two data set, based on the GM_CODE

In [12]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)
# A
gemeente2provincie_mapping2020 = pd.read_excel(data_folder/'gemeente2provincie_mapping2020.xlsx')
In [13]:
# B
gemeente2provincie_mapping2020.describe()
gemeente2provincie_mapping2020.head()
Out[13]:
Gemeentecode GM_CODE Gemeentenaam Provinciecode ProvinciecodePV Provincienaam
0 1680 GM1680 Aa en Hunze 22 PV22 Drenthe
1 358 GM0358 Aalsmeer 27 PV27 Noord-Holland
2 197 GM0197 Aalten 25 PV25 Gelderland
3 59 GM0059 Achtkarspelen 21 PV21 Friesland
4 482 GM0482 Alblasserdam 28 PV28 Zuid-Holland
In [14]:
# C
gemeente_df = gemeente_df.merge(gemeente2provincie_mapping2020,on='GM_CODE')
wijk_df = wijk_df.merge(gemeente2provincie_mapping2020,on='GM_CODE')
buurt_df = buurt_df.merge(gemeente2provincie_mapping2020,on='GM_CODE')
buurt_df['Provincienaam'].head()
Out[14]:
0    Groningen
1    Groningen
2    Groningen
3    Groningen
4    Groningen
Name: Provincienaam, dtype: object

2. Discovering and visualising the data¶

After having constructed our data set, it is time to explore the data more in depth. To do so, we use pandas describe, head and info functions. Additionally, to visualise the data, we create histograms (empirical cumulative density function plots). This is done using seaborn package.

Let's first look at the key variable of interest: lbm (liveability), and visualise its ditribution using a histogram and a cumulative density function plot.

In [15]:
# Create histogram and empirical CDF for variable(s) of interest
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharex=True)
sns.histplot(ax = axes[0],x = wijk_df.lbm)
ecdf_data = sns.ecdfplot(ax = axes[1],x = wijk_df.lbm)
axes[0].set_xlabel("lbm score")
axes[1].set_xlabel("lbm score")
axes[1].grid(True,linewidth = 0.5)
axes[1].minorticks_on()
axes[1].grid(which='minor', linestyle=':', linewidth='0.5', color='black')
No description has been provided for this image

Interpretation of results

  • Liveability of wijken is approximately normally distributed (no skew / higher moments)
  • The mean and median liveability of wijken is ~4.15
  • The domain of liveability is ~ [3.7-4.6]

Exercise 2: Distribution of social cohesion¶

Besides liveability, in the liveability data there is also a variable on social cohesion soc.
A Conduct several descriptive analyses, such as histograms, boxplots, etc, on the variable soc
B Interpret your results

In [16]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)
# A
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharex=True)
sns.histplot(ax = axes[0],x = wijk_df.soc)
ecdf_data = sns.ecdfplot(ax = axes[1],x = wijk_df.soc)
axes[0].set_xlabel("social cohesion")
axes[1].set_xlabel("social cohesion")
axes[1].grid(True,linewidth = 0.5)
axes[1].minorticks_on()
axes[1].grid(which='minor', linestyle=':', linewidth='0.5', color='black')
No description has been provided for this image
In [17]:
# B
# There seems to be a large spread. Furthermore, we see that this variable is 0 centrered

An important step when discovering data is to evaluate their face validity. To do so, let's see whether to top-rated wijken seems to make sense.

In [18]:
# List top-5 best wijken, based on liveability
wijk_df.nlargest(5,'lbm').head(5)
Out[18]:
WK_CODE gm_naam geometry WK_NAAM GM_CODE GM_NAAM IND_WBI H2O OAD STED ... fys onv soc vrz won Gemeentecode Gemeentenaam Provinciecode ProvinciecodePV Provincienaam
808 WK036302 Amsterdam POLYGON ((120668.992 487465.601, 120670.593 48... Grachtengordel-West GM0363 Amsterdam 1 NEE 10678 1 ... 0.003124 -0.057002 -0.057766 0.612074 0.025129 363 Amsterdam 27 PV27 Noord-Holland
828 WK036322 Amsterdam POLYGON ((120016.875 486171.765, 120130.694 48... Vondelbuurt GM0363 Amsterdam 1 NEE 9724 1 ... -0.016662 -0.032454 -0.051559 0.594127 0.027747 363 Amsterdam 27 PV27 Noord-Holland
809 WK036303 Amsterdam POLYGON ((120757.415 486467.356, 120750.405 48... Grachtengordel-Zuid GM0363 Amsterdam 1 NEE 9272 1 ... 0.022858 -0.097292 -0.040515 0.607788 0.027745 363 Amsterdam 27 PV27 Noord-Holland
119 WK009600 Vlieland MULTIPOLYGON (((138217.745 590528.676, 138315.... Wijk 00 GM0096 Vlieland 1 NEE 208 5 ... 0.376499 0.122632 0.033750 -0.043568 0.008214 96 Vlieland 21 PV21 Friesland
1291 WK051810 's-Gravenhage POLYGON ((79575.76 456937.04, 79578.4 456936.4... Wijk 10 Zorgvliet GM0518 's-Gravenhage 1 NEE 4163 1 ... 0.087135 0.072620 0.014838 0.233574 0.087249 518 's-Gravenhage 28 PV28 Zuid-Holland

5 rows × 198 columns

Exercise 3: Face validity of worst neighbourhoods (wijken)¶

The best scoring neighbourhoods seem to make sense. Let's also look at the worst scoring neighbourhoods.
A List the 10 neighourhoods (wijken) which score lowest on lbm

In [19]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)
# A
wijk_df.nsmallest(5,'lbm').head(10)
Out[19]:
WK_CODE gm_naam geometry WK_NAAM GM_CODE GM_NAAM IND_WBI H2O OAD STED ... fys onv soc vrz won Gemeentecode Gemeentenaam Provinciecode ProvinciecodePV Provincienaam
69 WK008050 Leeuwarden POLYGON ((183883.611 580216.812, 183875.967 58... Heechterp & Schieringen GM0080 Leeuwarden 1 NEE 2395 2 ... 0.032619 -0.241097 -0.078192 -0.009183 -0.064056 80 Leeuwarden 21 PV21 Friesland
1311 WK051830 's-Gravenhage POLYGON ((79514.87 453802.13, 79503.54 453827.... Wijk 30 Transvaalkwartier GM0518 's-Gravenhage 1 NEE 7941 1 ... -0.041024 -0.284179 -0.117140 0.234298 -0.146054 518 's-Gravenhage 28 PV28 Zuid-Holland
735 WK034403 Utrecht POLYGON ((134335.839 458999.112, 134250.809 45... Wijk 03 Overvecht GM0344 Utrecht 1 NEE 3276 1 ... -0.030176 -0.246043 -0.087676 0.103762 -0.074552 344 Utrecht 26 PV26 Utrecht
1160 WK047912 Zaanstad POLYGON ((118678.807 495146.424, 118695.683 49... Wijk 12 Poelenburg GM0479 Zaanstad 1 NEE 2176 2 ... -0.008356 -0.185486 -0.073643 0.061053 -0.121695 479 Zaanstad 27 PV27 Noord-Holland
1916 WK085539 Tilburg POLYGON ((130930.719 399235.829, 131258.299 39... Wandelbos Noord GM0855 Tilburg 1 NEE 2018 2 ... -0.019296 -0.158182 -0.055536 -0.023011 -0.065888 855 Tilburg 30 PV30 Noord-Brabant

5 rows × 198 columns

As we have added geospatial data to the dataframe, we can visualise the data in space.
To further discover the data, let's first look at the municipality level. We color municipalities based on the average liveability score.

In [20]:
# Plot the liveability in the Netherlands at the wijk-level
fig, ax = plt.subplots(figsize=(60,60))
wijk_df.plot(ax=ax, column = 'lbm', legend = True)
gemeente_df.plot(ax=ax, color  = 'none', edgecolor='black')
_=gemeente_df.apply(lambda x: ax.annotate(text=x['GM_NAAM'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1)
# gemeente_df.apply(lambda x: '' if x['AANT_INW']<50000 else ax.annotate(text=x['GM_NAAM'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1)
No description has been provided for this image

Interpretation of results

  • Wijken located near the coast (including the islands) score above average.
  • Larger cities (e.g. AMS, RT, Tilburg, Utrecht, The Hague) seem to have multiple very poor performing wijken
  • In some municipalities, data are patchy (missing)

Next, we do the same, but at the lowest spatial level (buurten) focusing on the municipality of Delft.

In [21]:
# Plot the liveability in Delft at the buurt-level
fig, ax = plt.subplots(figsize=(40,40))
buurt_df.loc[buurt_df.GM_NAAM == 'Delft'].plot(ax=ax, column = 'lbm', legend = True)
gemeente_df.loc[gemeente_df.GM_NAAM == 'Delft'].plot(ax=ax, color  = 'none', edgecolor='black')
buurt_df.loc[buurt_df.GM_NAAM == 'Delft'].apply(lambda x: ax.annotate(text=x['BU_NAAM'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1)
plt.show()
No description has been provided for this image

Exercise 4: Visualise local neighbourhoods in Zuid-Holland¶

A Plot the liveability score in Zuid-Holland at the buurt level

In [22]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)
# A

fig, ax = plt.subplots(figsize=(40,40))
buurt_df.loc[buurt_df.Provincienaam == 'Zuid-Holland'].plot(ax=ax, column = 'lbm', legend = True)
gemeente_df.loc[gemeente_df.Provincienaam == 'Zuid-Holland'].plot(ax=ax, color  = 'none', edgecolor='black')
gemeente_df.loc[gemeente_df.Provincienaam == 'Zuid-Holland'].apply(lambda x: ax.annotate(text=x['GM_NAAM'], xy=x.geometry.centroid.coords[0], ha='center'), axis=1)
plt.show()
No description has been provided for this image

3. Exploring associations¶

To explore associations, we use correlation heat maps, scatter plots and boxplots

In [23]:
# heatmap of correlations
# Create plot
fig, axes = plt.subplots(figsize=(10, 10))
fig.set_tight_layout(True)

# Compute correlation matrix
corr = wijk_df[['BEV_DICHTH','AANT_INW','AF_SUPERM','AF_RESTAU','P_KOOPWON','P_HUURWON', 'fys', 'onv', 'soc', 'vrz', 'won','lbm']].corr()

# Create upper triangular matrix to mask the upper triangular part of the heatmap
corr_mask = np.triu(np.ones_like(corr, dtype=bool))

# Generate a custom diverging colormap (because it looks better)
corr_cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(corr, mask = corr_mask, cmap=corr_cmap, annot=True,square = True, linewidths=.5, ax = axes)
plt.show()
No description has been provided for this image

Interpretation of the results

  1. Pearson corr coefficients range from [-0.99,0.86]
  2. P_HUURWON and P_KOOPWON correlate (almost) 1 to 1 (which makes sense).

Looking at the bottom row for lbm, we see that:

  1. BEV_DICHTH, AANT_INW, P_HUURWON correlate negatively with liveability
  2. AF_SUPERM, AF_RESTAU, P_KOOPWON, fys, onv, soc, vrz, won correlate positively with liveability
  3. lbm correlates strongest with onv (nuisance and insecurity), soc (social cohesion), won (housing stock)
In [24]:
# Further explore linear associations using scatter plots
# plot
fig, axes = plt.subplots(3, 2, figsize=(20, 10))
fig.set_tight_layout(True)
sns.scatterplot(ax = axes[0,0],x = wijk_df.BEV_DICHTH, y = wijk_df.lbm, alpha = 0.2)
sns.scatterplot(ax = axes[0,1],x = wijk_df.AANT_INW, y = wijk_df.lbm, alpha = 0.2)   
sns.scatterplot(ax = axes[1,0],x = wijk_df.AF_SUPERM, y = wijk_df.lbm, alpha = 0.2)
sns.scatterplot(ax = axes[1,1],x = wijk_df.AF_RESTAU, y = wijk_df.lbm, alpha = 0.2)
sns.scatterplot(ax = axes[2,0],x = wijk_df.P_KOOPWON[wijk_df.P_KOOPWON>0], y = wijk_df.lbm[wijk_df.P_KOOPWON>0], alpha = 0.2)   
sns.scatterplot(ax = axes[2,1],x = wijk_df.P_HUURWON[wijk_df.P_HUURWON>0], y = wijk_df.lbm[wijk_df.P_HUURWON>0], alpha = 0.2)   


axes[0,0].set_xlabel('Population density [#/km^2]')
axes[0,1].set_xlabel('Population size [#]')
axes[1,0].set_xlabel('Distance to supermarket [km]')
axes[1,1].set_xlabel('Distance to restaurant [km]')
axes[2,0].set_xlabel('Percentage owner occupied homes [%]')
axes[2,1].set_xlabel('Percentage rental homes [%]')
Out[24]:
Text(0.5, 0, 'Percentage rental homes [%]')
No description has been provided for this image

Interpretation of the results

  • The scatter plots show the bivariate relations with liveability is strongest for P_KOOPWON and P_HUURWON.
  • For BEV_DICHTH, AANT_INW, AF_RESTAU and P_KOOPWON the data seems skewed, condensed around x = 0.

Boxplot analysis

When one of the features is categorical a boxplot can be insightful.
Boxplots are informative as they also reveal insights on the locality, spread and skewness of numerical data through their quartiles, see https://en.wikipedia.org/wiki/Box_plot.
Below we create a boxplot showing the liveability across a selected number of municipalities based on the liveabilities scores at the buurt level. Add your own municipality to see if the liveability scores are in line with your own perceptions.

In [25]:
# Box plot
fig, axes = plt.subplots(1, 1, figsize=(15, 10), sharex=True)
GM_select = ['Amsterdam','Rotterdam','Utrecht',"'s-Gravenhage",'Groningen','Tilburg']
df_GM_select = buurt_df[buurt_df['GM_NAAM'].isin(GM_select)]
sns.boxplot(ax = axes, y = df_GM_select.lbm, x = df_GM_select.GM_NAAM, orient="v" )

axes.grid(True,linewidth = 0.5)
plt.show()
No description has been provided for this image

Interpretation of results

  • Looking at the median scores, Amsterdam and Groninger are the best municipalities to live in.
  • Unlike Groningen, Amsterdam also has very unliveable buurten. This can be seen from the outliers at the lower side.
  • Larger cities seem to have a higher variance in the liveability scores across buurten (which is to be expected).

Exercise 5: Scatter the median liveability scores of gemeentes based on wijk and buurt level data¶

A Create a scatter plot in which the median liveability score of the gemeente as computed using wijk level data is at the x-axis and the median liveability score as computed using buurt level data is at the y-axis. (hint: use the pandas method groupby)
B Interpret the results

In [26]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)
# A
# Get median values per gemeente
median_lbm_on_wk = wijk_df.groupby('GM_CODE')['lbm'].median()
median_lbm_on_bu = buurt_df.groupby('GM_CODE')['lbm'].median()

fig, axes = plt.subplots(1, 1, figsize=(25, 25))
sns.scatterplot(x = median_lbm_on_wk, y = median_lbm_on_bu)
axes.set_aspect('equal', adjustable='box')
axes.grid(True,linewidth = 0.5)

# Add diagonal line
x_diag = [min(median_lbm_on_bu),max(median_lbm_on_bu)]
y_diag = [min(median_lbm_on_bu),max(median_lbm_on_bu)]
sns.lineplot(x = x_diag, y = x_diag,linestyle = ':',color = 'k')

# Loop through the data points 
for i, gem_name in enumerate(gemeente_df.GM_CODE):
    plt.text(median_lbm_on_wk.iloc[i],median_lbm_on_bu.iloc[i],gemeente_df.GM_NAAM.iloc[i])
No description has been provided for this image

4. Conducting multiple linear regression analyses¶

Regression models are often used to obtain first insights on the relationship between a (scalar) target and one or more features. Regression models come in many different forms: e.g. simple, multiple, ordinal, Poisson, etc. Usually, a multiple linear regression model serves as a good benchmark.

In ML the focus is usually on generalisation. That is, the model performance is evaluated by looking at the performance on an unseen set of the data (generalisation performance).

For these analyses, we use the sk-learn Python library. This Python library is especially designed machine learning. Having seen the scatter plots and bar plots, we focus on the following 7 features: ANNT_INW, BEV_DICHTH, P_KOOPWON, fys, soc, won, and onv

In [27]:
# Set the seed number for reproducibility. It governs which data points go to the test set and which go to the training set
rs = 12345

# Create X
X_lin = pd.DataFrame([buurt_df["AANT_INW"].div(100000),buurt_df["BEV_DICHTH"].div(10000), buurt_df["P_KOOPWON"].div(100),buurt_df["fys"],buurt_df["soc"],buurt_df["won"],buurt_df["onv"]]).transpose()


# Create Y
Y = buurt_df.lbm

# Create linear regression object
regr = LinearRegression(fit_intercept = True)

# Split the data in a train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_lin, Y, test_size=0.25,random_state = rs)

# Fit the model on the training data
regr.fit(X_train,Y_train)

# Evaluate the model generalisation performance on the Train and Test data sets
Y_pred_train = regr.predict(X_train)
mse_train = mean_squared_error(Y_train,Y_pred_train)
R2_train = r2_score(Y_train,Y_pred_train)

Y_pred_test = regr.predict(X_test)
mse_test  = mean_squared_error(Y_test, Y_pred_test)
R2_test = r2_score(Y_test,Y_pred_test)

# Print results
print('Results linear multiple regression model')
print(f'Mean Squared Error Train | Test: \t{mse_train:>7.4f}\t|  {mse_test:>7.4f}')
print(f'R2                 Train | Test: \t{R2_train:>7.4f}\t|  {R2_test:>7.4f}\n')
print('Weights')
print(f'Intercept: \t\t\t {regr.intercept_:>7.4f}')
for n in range(len(regr.coef_)):
    print(f'Weight_{X_lin.keys()[n]:10s} \t\t {regr.coef_[n]:>7.4f}')
Results linear multiple regression model
Mean Squared Error Train | Test: 	 0.0066	|   0.0071
R2                 Train | Test: 	 0.5991	|   0.5820

Weights
Intercept: 			  4.0516
Weight_AANT_INW   		 -0.0626
Weight_BEV_DICHTH 		  0.1351
Weight_P_KOOPWON  		 -0.0259
Weight_fys        		  0.7715
Weight_soc        		  0.3782
Weight_won        		  1.1780
Weight_onv        		  0.6743

Interpretation of results

  • Looking at the MSE and R2, we see that the performance on the test data set is roughly equal to the performance on the train data set. This tells us that the model is not overfitting.
  • An advantage of regression models is that their weights are interpretable. For instance, the positive and comparatively large weight associated with BEV_DICHTH tells us that high population density is associated with high liveability.

Exercise 6: Impact of seed number on the regression results¶

A Re-do the regression using different seed numbers (e.g. 1,2,3,4. etc) and reflect on the stability of the results

In [28]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)
# A

Exercise 7: Generalisation¶

A Train a new extended multiple regression model. To do so, add 13 extra features to the model that could be associated with liveability. For this extended model, compute the Mean Squared Error and R2 both for the train and test data sets.
B The picture below shows the Bias-Variance trade-off. Looking at the MSE and R2 of the model with 7 features and the extended model (with 20 features), where do you think these two models can be placed in the picture (Region A, B, C, D, or E)? Explain your answer.


Bias_variance_tradeoff

In [29]:
# CODE YOUR ANSWERS HERE (Use as many cells as you need)
# A
# Your code here
# Use as many cells as you need

# A

# Create data frames
frames1 = pd.DataFrame([buurt_df["AANT_INW"].div(100000),buurt_df["BEV_DICHTH"].div(10000), buurt_df["P_KOOPWON"].div(100),buurt_df["fys"],buurt_df["soc"],buurt_df["won"],buurt_df["onv"]]).transpose()
frames2 = buurt_df[['P_15_24_JR','P_STERFT','P_GESCHEID','STED','WOZ','P_MAROKKO','P_ANT_ARU','P_SURINAM','P_TURKIJE','P_1GEZW','P_MGEZW','OAD','AANT_VROUW']]

X_lin_extended = pd.concat([frames1,frames2],axis=1)
X_lin_extended.replace(-99999999,np.nan, inplace = True)
X_lin_extended.fillna(X_lin_extended.mean(),inplace = True)

# Create Y
Y = buurt_df.lbm

# Create linear regression object
regr = LinearRegression(fit_intercept = True)

# Split the data in a train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X_lin_extended, Y, test_size=0.25,random_state = rs)

# Fit the model on the training data
regr.fit(X_train,Y_train)

# Evaluate the model generalisation performance on the Train and Test data sets
Y_pred_train = regr.predict(X_train)
mse_train = mean_squared_error(Y_train,Y_pred_train)
R2_train = r2_score(Y_train,Y_pred_train)

Y_pred_test = regr.predict(X_test)
mse_test  = mean_squared_error(Y_test, Y_pred_test)
R2_test = r2_score(Y_test,Y_pred_test)

# Print results
print('Results linear multiple regression model')
print(f'Mean Squared Error Train | Test: \t{mse_train:>7.4f}\t|  {mse_test:>7.4f}')
print(f'R2                 Train | Test: \t{R2_train:>7.4f}\t|  {R2_test:>7.4f}\n')
print('Weights')
print(f'Intercept: \t\t\t {regr.intercept_:>7.4f}')
for n in range(len(regr.coef_)):
    print(f'Weight_{X_lin_extended.keys()[n]:10s} \t\t {regr.coef_[n]:>7.4f}')

# B 
# The extended model is probably in region B. We conclude this because we still see little overfitting going on. This suggests the model could further increase in complexity. 
# As a result, we can conclude the first model is probably in Region A. It fits worse that the extended model. This says that it is more biased.
Results linear multiple regression model
Mean Squared Error Train | Test: 	 0.0029	|   0.0028
R2                 Train | Test: 	 0.8251	|   0.8356

Weights
Intercept: 			 10.1865
Weight_AANT_INW   		 -0.9073
Weight_BEV_DICHTH 		 -0.0198
Weight_P_KOOPWON  		  0.0209
Weight_fys        		  0.8517
Weight_soc        		  0.7596
Weight_won        		  1.0501
Weight_onv        		  0.9182
Weight_P_15_24_JR 		 -0.0001
Weight_P_STERFT   		  0.0000
Weight_P_GESCHEID 		 -0.0023
Weight_STED       		 -0.0167
Weight_WOZ        		  0.0000
Weight_P_MAROKKO  		 -0.0003
Weight_P_ANT_ARU  		 -0.0061
Weight_P_SURINAM  		  0.0003
Weight_P_TURKIJE  		 -0.0015
Weight_P_1GEZW    		 -0.0614
Weight_P_MGEZW    		 -0.0607
Weight_OAD        		  0.0001
Weight_AANT_VROUW 		  0.0000